Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 10 de 10
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Genome Biol ; 24(1): 274, 2023 Nov 30.
Artigo em Inglês | MEDLINE | ID: mdl-38037131

RESUMO

BACKGROUND: As a single reference genome cannot possibly represent all the variation present across human individuals, pangenome graphs have been introduced to incorporate population diversity within a wide range of genomic analyses. Several data structures have been proposed for representing collections of genomes as pangenomes, in particular graphs. RESULTS: In this work, we collect all publicly available high-quality human haplotypes and construct the largest human pangenome graphs to date, incorporating 52 individuals in addition to two synthetic references (CHM13 and GRCh38). We build variation graphs and de Bruijn graphs of this collection using five of the state-of-the-art tools: Bifrost, mdbg, Minigraph, Minigraph-Cactus and pggb. We examine differences in the way each of these tools represents variations between input sequences, both in terms of overall graph structure and representation of specific genetic loci. CONCLUSION: This work sheds light on key differences between pangenome graph representations, informing end-users on how to select the most appropriate graph type for their application.


Assuntos
Algoritmos , Software , Humanos , Análise de Sequência de DNA/métodos , Genômica/métodos , Genoma
2.
NAR Genom Bioinform ; 5(3): lqad074, 2023 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-37608802

RESUMO

Bioinformatics is a field known for the numerous standards and formats that have been developed over the years. This plethora of formats, sometimes complementary, and often redundant, poses many challenges to bioinformatics data analysts. They constantly need to find the best tool to convert their data into the suitable format, which is often a complex, technical and time consuming task. Moreover, these small yet important tasks are often difficult to make reproducible. To overcome these difficulties, we initiated BioConvert, a collaborative project to facilitate the conversion of life science data from one format to another. BioConvert aggregates existing software within a single framework and complemented them with original code when needed. It provides a common interface to make the user experience more streamlined instead of having to learn tens of them. Currently, BioConvert supports about 50 formats and 100 direct conversions in areas such as alignment, sequencing, phylogeny, and variant calling. In addition to being useful for end-users, BioConvert can also be utilized by developers as a universal benchmarking framework for evaluating and comparing numerous conversion tools. Additionally, we provide a web server implementing an online user-friendly interface to BioConvert, hence allowing direct use for the community.

3.
Artigo em Inglês | MEDLINE | ID: mdl-38712341

RESUMO

A colored de Bruijn graph (also called a set of k-mer sets), is a set of k-mers with every k-mer assigned a set of colors. Colored de Bruijn graphs are used in a variety of applications, including variant calling, genome assembly, and database search. However, their size has posed a scalability challenge to algorithm developers and users. There have been numerous indexing data structures proposed that allow to store the graph compactly while supporting fast query operations. However, disk compression algorithms, which do not need to support queries on the compressed data and can thus be more space-efficient, have received little attention. The dearth of specialized compression tools has been a detriment to tool developers, tool users, and reproducibility efforts. In this paper, we develop a new tool that compresses colored de Bruijn graphs to disk, building on previous ideas for compression of k-mer sets and indexing colored de Bruijn graphs. We test our tool, called ESS-color, on various datasets, including both sequencing data and whole genomes. ESS-color achieves better compression than all evaluated tools and all datasets, with no other tool able to consistently achieve less than 44% space overhead.

4.
Bioinformatics ; 38(18): 4423-4425, 2022 09 15.
Artigo em Inglês | MEDLINE | ID: mdl-35904548

RESUMO

SUMMARY: Bioinformatics applications increasingly rely on ad hoc disk storage of k-mer sets, e.g. for de Bruijn graphs or alignment indexes. Here, we introduce the K-mer File Format as a general lossless framework for storing and manipulating k-mer sets, realizing space savings of 3-5× compared to other formats, and bringing interoperability across tools. AVAILABILITY AND IMPLEMENTATION: Format specification, C++/Rust API, tools: https://github.com/Kmer-File-Format/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Software , Análise de Sequência de DNA , Discos Compactos
5.
Nucleic Acids Res ; 48(D1): D465-D469, 2020 01 08.
Artigo em Inglês | MEDLINE | ID: mdl-31691799

RESUMO

Norine, the unique resource dedicated to nonribosomal peptides (NRPs), is now updated with a new pipeline to automate massive sourcing and enhance annotation. External databases are mined to extract NRPs that are not yet in Norine. To maintain a high data quality, successive filters are applied to automatically validate the NRP annotations and only validated data is inserted in the database. External databases were also used to complete annotations of NRPs already in Norine. Besides, annotation consistency inside Norine and between Norine and external sources have reported annotation errors. Some can be corrected automatically, while others need manual curation. This new approach led to the insertion of 539 new NRPs and the addition or correction of annotations of nearly all Norine entries. Two new tools to analyse the chemical structures of NRPs (rBAN) and to infer a molecular formula from the mass-to-charge ratio of an NRP (Kendrick Formula Predictor) were also integrated. Norine is freely accessible from the following URL: https://bioinfo.cristal.univ-lille.fr/norine/.


Assuntos
Bases de Dados de Proteínas , Biossíntese de Peptídeos Independentes de Ácido Nucleico , Software , Proteínas de Bactérias/biossíntese , Proteínas de Bactérias/química , Proteínas Fúngicas/biossíntese , Proteínas Fúngicas/química
6.
BMC Bioinformatics ; 20(1): 88, 2019 Feb 19.
Artigo em Inglês | MEDLINE | ID: mdl-30782112

RESUMO

BACKGROUND: High-throughput amplicon sequencing of environmental DNA (eDNA metabarcoding) has become a routine tool for biodiversity survey and ecological studies. By including sample-specific tags in the primers prior PCR amplification, it is possible to multiplex hundreds of samples in a single sequencing run. The analysis of millions of sequences spread into hundreds to thousands of samples prompts for efficient, automated yet flexible analysis pipelines. Various algorithms and software have been developed to perform one or multiple processing steps, such as paired-end reads assembly, chimera filtering, Operational Taxonomic Unit (OTU) clustering and taxonomic assignment. Some of these software are now well established and widely used by scientists as part of their workflow. Wrappers that are capable to process metabarcoding data from raw sequencing data to annotated OTU-to-sample matrix were also developed to facilitate the analysis for non-specialist users. Yet, most of them require basic bioinformatic or command-line knowledge, which can limit the accessibility to such integrative toolkits. Furthermore, for flexibility reasons, these tools have adopted a step-by-step approach, which can prevent an easy automation of the workflow, and hence hamper the analysis reproducibility. RESULTS: We introduce SLIM, an open-source web application that simplifies the creation and execution of metabarcoding data processing pipelines through an intuitive Graphic User Interface (GUI). The GUI interact with well-established software and their associated parameters, so that the processing steps are performed seamlessly from the raw sequencing data to an annotated OTU-to-sample matrix. Thanks to a module-centered organization, SLIM can be used for a wide range of metabarcoding cases, and can also be extended by developers for custom needs or for the integration of new software. The pipeline configuration (i.e. the modules chaining and all their parameters) is stored in a file that can be used for reproducing the same analysis. CONCLUSION: This web application has been designed to be user-friendly for non-specialists yet flexible with advanced settings and extensibility for advanced users and bioinformaticians. The source code along with full documentation is available on the GitHub repository ( https://github.com/yoann-dufresne/SLIM ) and a demonstration server is accessible through the application website ( https://trtcrd.github.io/SLIM/ ).


Assuntos
Código de Barras de DNA Taxonômico/métodos , Internet , Software , Algoritmos , Reprodutibilidade dos Testes , Interface Usuário-Computador
7.
Mol Ecol Resour ; 18(6): 1381-1391, 2018 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-30014577

RESUMO

Biodiversity monitoring is the standard for environmental impact assessment of anthropogenic activities. Several recent studies showed that high-throughput amplicon sequencing of environmental DNA (eDNA metabarcoding) could overcome many limitations of the traditional morphotaxonomy-based bioassessment. Recently, we demonstrated that supervised machine learning (SML) can be used to predict accurate biotic indices values from eDNA metabarcoding data, regardless of the taxonomic affiliation of the sequences. However, it is unknown to which extent the accuracy of such models depends on taxonomic resolution of molecular markers or how SML compares with metabarcoding approaches targeting well-established bioindicator species. In this study, we address these issues by training predictive models upon five different ribosomal bacterial and eukaryotic markers and measuring their performance to assess the environmental impact of marine aquaculture on independent data sets. Our results show that all tested markers are yielding accurate predictive models and that they all outperform the assessment relying solely on taxonomically assigned sequences. Remarkably, we did not find any significant difference in the performance of the models built using universal eukaryotic or prokaryotic markers. Using any molecular marker with a taxonomic range broad enough to comprise different potential bioindicator taxa, SML approach can overcome the limits of taxonomy-based eDNA bioassessment.


Assuntos
Bactérias/classificação , Biodiversidade , Código de Barras de DNA Taxonômico/métodos , Monitoramento Ambiental/métodos , Eucariotos/classificação , Metagenômica/métodos , Aprendizado de Máquina Supervisionado , Biomarcadores/análise , Simulação por Computador , RNA Ribossômico/genética
8.
Bioinformatics ; 34(4): 585-591, 2018 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-29040406

RESUMO

Motivation: Advances in the sequencing of uncultured environmental samples, dubbed metagenomics, raise a growing need for accurate taxonomic assignment. Accurate identification of organisms present within a community is essential to understanding even the most elementary ecosystems. However, current high-throughput sequencing technologies generate short reads which partially cover full-length marker genes and this poses difficult bioinformatic challenges for taxonomy identification at high resolution. Results: We designed MATAM, a software dedicated to the fast and accurate targeted assembly of short reads sequenced from a genomic marker of interest. The method implements a stepwise process based on construction and analysis of a read overlap graph. It is applied to the assembly of 16S rRNA markers and is validated on simulated, synthetic and genuine metagenomes. We show that MATAM outperforms other available methods in terms of low error rates and recovered fractions and is suitable to provide improved assemblies for precise taxonomic assignments. Availability and implementation: https://github.com/bonsai-team/matam. Contact: pierre.pericard@gmail.com or helene.touzet@univ-lille1.fr. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Microbioma Gastrointestinal/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Metagenoma , Filogenia , Software , Algoritmos , Humanos , Metagenômica/métodos , RNA Ribossômico 16S/genética , Análise de Sequência de DNA/métodos
9.
Nucleic Acids Res ; 44(D1): D1113-8, 2016 Jan 04.
Artigo em Inglês | MEDLINE | ID: mdl-26527733

RESUMO

Since its creation in 2006, Norine remains the unique knowledgebase dedicated to non-ribosomal peptides (NRPs). These secondary metabolites, produced by bacteria and fungi, harbor diverse interesting biological activities (such as antibiotic, antitumor, siderophore or surfactant) directly related to the diversity of their structures. The Norine team goal is to collect the NRPs and provide tools to analyze them efficiently. We have developed a user-friendly interface and dedicated tools to provide a complete bioinformatics platform. The knowledgebase gathers abundant and valuable annotations on more than 1100 NRPs. To increase the quantity of described NRPs and improve the quality of associated annotations, we are now opening Norine to crowdsourcing. We believe that contributors from the scientific community are the best experts to annotate the NRPs they work on. We have developed MyNorine to facilitate the submission of new NRPs or modifications of stored ones. This article presents MyNorine and other novelties of Norine interface released since the first publication. Norine is freely accessible from the following URL: http://bioinfo.lifl.fr/NRP.


Assuntos
Bases de Dados de Compostos Químicos , Peptídeos/química , Peptídeos/farmacologia , Internet , Bases de Conhecimento , Anotação de Sequência Molecular , Peptídeos/metabolismo
10.
J Cheminform ; 7: 62, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26715946

RESUMO

BACKGROUND: The monomeric composition of polymers is powerful for structure comparison and synthetic biology, among others. Many databases give access to the atomic structure of compounds but the monomeric structure of polymers is often lacking. We have designed a smart algorithm, implemented in the tool Smiles2Monomers (s2m), to infer efficiently and accurately the monomeric structure of a polymer from its chemical structure. RESULTS: Our strategy is divided into two steps: first, monomers are mapped on the atomic structure by an efficient subgraph-isomorphism algorithm ; second, the best tiling is computed so that non-overlapping monomers cover all the structure of the target polymer. The mapping is based on a Markovian index built by a dynamic programming algorithm. The index enables s2m to search quickly all the given monomers on a target polymer. After, a greedy algorithm combines the mapped monomers into a consistent monomeric structure. Finally, a local branch and cut algorithm refines the structure. We tested this method on two manually annotated databases of polymers and reconstructed the structures de novo with a sensitivity over 90 %. The average computation time per polymer is 2 s. CONCLUSION: s2m automatically creates de novo monomeric annotations for polymers, efficiently in terms of time computation and sensitivity. s2m allowed us to detect annotation errors in the tested databases and to easily find the accurate structures. So, s2m could be integrated into the curation process of databases of small compounds to verify the current entries and accelerate the annotation of new polymers. The full method can be downloaded or accessed via a website for peptide-like polymers at http://bioinfo.lifl.fr/norine/smiles2monomers.jsp.Graphical abstract:.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...